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ABSTRACT 



An analysis of the evaluation instruments oi clinical clerkships 
from 54 medical schools was made. Instruments were classified as to 
purpose, format, and skills measured* 

One purpose of all forms was to evaluate learners. Most forms 
gathered data that could be incorporated into internship letters. Forms 
of 16 schools also provided feedback on progress to the students. 

Thirty-^nine schools used a modified Likert format; a few schools 
also had a check list of adjective? or short answer questions. Nearly 
all instniments had some space for general comraents. 

The most frequently evaluated concepts and skills were "knowledge 
"getting along well with others," "hard worker," "ability," "dignity," 
"history-taking," and "performance." 

Several principles of the design of evaluation instruments were dis- 
cussed. One of these was that the instrument should be part of an evalu- 
ation system^ and should evaluate the specific tasks and objectives that 
have been identified in the first stages of the learning system. Other 
principles were that the instrument should be similar to the clinical 
skill, it should ba derived frora a content sampling map, it should not 
be used for two T/urposes that have conflicting goals, and it should be 
reliable and valid. Several suggestions were made to increase the re- 
liability of clinical evaluation instruments. The use of comments as a 
replacement for the measurement of specific objectives and content samp- 
ling was discouraged. 

Although five skills were recommended to be Included in clinical 
evaluation instruments, and the influence of national Board examinations 
was pointed out, it was recommended that the objectives being measured 
be ^* function of che specific objectives and constraints of the local 
irsitution. 



A Survey of Evaluation Instruments Used in Clinical 
Clerkships in American Medical Schools 

J« C. Reid, Ph.D. 
July 1974 

The evaluation of medical students' performance in clinical clerk- 
ships remains a major unsolved problem, even though several articles 
have reported efforts to evaluate clinical students at specific medical 
schools. (1-6) Evaluation is one part of a learning system. Although 
system designs differ somewhat, nearly all learning systems contain as 
components, the description of the existing system, specifications of 
objectives, execution of a task analysis, designing the instruction, con- 
ducting a formative evaluation, and revising the instruction. (7) 

The formative and stimmative evaluations of a clinical clerkship 
should reflect the objectives (8) and task analyses (9-10) of that clerk- 
ship. Accordingly, one should be able to understand other schools* ob- 
jectives and task analyses of clinical clerkships by analyzing their 
evaluation forms. 

The purpose of this report is to analyze typical suicmative evalua- 
tion instruments of clinical clerkships in America, and to summarize the 
objectives or tasks being evaluated thereby. Since no single study sum- 
marizes either learning systems or typical evaluation forms of many medi- 
cal schools in the nation, the need for such a survey was plainly evident. 
It was assumed that although many forms might be used to evaluate a stu- 
dent in a clerkship, the most pertinent data for a clerkship would be re- 
corded on a single form for the student's permanent file. Finally, this 
report discusses the purposes of evaluation and some design and measure- 
ment principles that could be used to Improve current evaluation prac- 
tices. 



The author appreciates the suggestions of Dr. Jack M. Colvill* Any 
faults are the sole responsibility of the author. 

Dr. Reid is the Research Associate of the Evaluation Section, 
University of Missouri-Columbia Medical School. 
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Evaluations can be made for several purposes: to evaluate instruc- 
tion, learners, or learning; (11) to provide a data base from which in- 
ternship recommendations can be written, to give the student feedback on 
his weaknesses and strengths, and to answer research questions. 

Different purposes suggest that different items be included in the 
evaluation form. If the intent is to evaluate Instruction, then the 
Items on the evaluation instrument should describe the teaching, demon- 
strations. Instructional stimuli, clarity of objectives, and the instruc- 
tional environment to a major extent. To describe the measurement of in- 
struction was not a purpose of this study; a few schools thoughtfully 
sent forms, usually filled out by students, that did measure Instruction, 
but these forms are not included in the present study. 

If the purpose is to evaluate learners, then the items on the In- 
strument are dictated by the behavioral objectives and by the results of 
the task analysis of the clerkship. Since the required knowledge base 
differs across clerkships, it is not apparent how one single form can 
evaluate the knowledge of learners or learning for several different 
clinical clerkships, unless it is a summary sheet to which results from 
other tests are transcribed. On the other hand, the ability to collect 
data and to solve problems could perhaps be measured by similar instru- 
ments. The differing goals and constraints of various medical schools 
prevent a blanket adoption of one specific evaluation system across medi- 
cal schools. However, testing efforts by national Boards will unify ed- 
ucational objectives among medical schools to some degree. 

A third reason for evaluation is the measurement of learning* 
Learning is commonly measured by a pretest-posttest design (12), although 
In fact the evaluation of learning Is not simple (13)* 

If the evaluation form is to generate data for Internship letters, 
then it probably should request some descriptive vignettes that charac- 
terize the student, as well as data that will predict future success. 
If the purpose of evaluation Is to provide the student with feedback. 
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then the items should measure the objectives and provide direction to 
facilitate improvement. The purpose of evaluation as a research tool 
will not be discussed in this report. 

Procedure 

In December 1973, a request for a copy of the forms used to evalu-- 
ate basic science and clinical medical students was sent to 98 schools 
of medicine listed in the AAMC directory. By February 1974, replies 
had been received from 63 schools. Since the purpose of the study was 
to obtain typical ideas rather than to describe the sampling distributions 
of characteristics of medical evaluation forms, no attempts were made to 
increase the o^rcant of replies beyond the 64% obtained by the first re- 
quest. 

Of the 63 schools replying, 6 schools sent no evaluation form. 
These 6 schools typically indicated they used letter grades and/or a 
sheet of comments, and these 6 were not included in the analysis. Three 
schools sent only basic science forms. The present report is restricted 
to the analysis of the evaluation forms for clinical clerkship or single 
forms used for both clinical and basic science years that were sent by 
54 medical schools. Some schools had several forms and other schools 
had a form of several pages; these instances were counted as one form. 

The first step in analyzin^> the forms was to determine the purpose 
of the form. For the present study, the categories of evaluating instruc- 
tion and gathering data for research were ignored. It turned out that 
forms could not reliably be placed into the learning category, so that 
category was dropped. If the form indicated that comments could be made 
about the student, then it was judged as capable of generating character- 
istic vignettes for internship letters. Finally, if the form had a copy 
marked "student's copy," or if a phrase on the form indicated that re- 
sults could be shared with students, it was classified as capable of pro- 
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vldlng feedback to students. 

The second step was to determine the format of the form, whether it 
was a checklist, short answer, etc* The format was pertlne:it since It 
would affect the specificity of the objectives or task analysis, and the 
ease of using data for the purpose the form was designed. 

The third step was to determine what tasks (skills) or concepts 
were being measured by the evaluation forms. A frequency count was made 
of all words on the form that related to evaluation or measurement of 
performance. 

Words were Inspected for duality of meaning. For example, "rate" 
(and the same root with suffix -d) could mean "grade," as in "how would 
you rate this student," or "degree of growth," as in "rate of progress." 
Thus, homonyms and homographs were sorted into different classes; syno- 
nyms were grouped into similar tasks or concepts. 

Few medical school forms evaluated specific tasks* Jtost rather 
evaluated more global concepts. Although lessened specificity is not 
desirable for both Instructional and measurement reasons, the concepts 
were content analyzed along with a few tasks that were described. The 
analysis of concepts suggested classes of tasks that typified medical 
school concern* 

Results 

The first result discussed is the purpose of the clinical evalua*- 
tion forms* If the evaluation was only used to evaluate learners, that 
is, to grade the student, then there would be no need to give the stu- 
dent detailed feedback on how he performed* On the other hand, if the 
evaluation was also intended to improve the student* s performance or 
direct him Into areas appropriate to his strengths, then the evaluation 
system should in addition provide the student with detailed feedback on 
his weaknesses and strengths. Every one of the 54 clinical forms evalu- 
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ated learners, that Is, determined if a learner "passed'\ All 54 forms 
included a space for remarks or comments, and were therefore judged 
capable of providing some characterizing vignettes, so that faculty 
could write descriptive internship recommendations. In addition to 
evaluating learners and providing information for internship recommenda- 
tions, thirteen schools used their evaluation to improve the students* 
performance by directing their attention to specific strengths and weak- 
nesses, since they gave students a copy of the evaluation, or made a 
copy readily available. Three schools apparently made the feedback op- 
tional, as their evaluation sheets carried a question like: Check if this 
evaluation was discussed with the student: yes no. Forms (or instruc- 
tions) from four schools indicated that students in danger of failing a 
clerkship had extensive counsel made available to them. This practice 
is probably fairly common. The remaining evaluation forms did not men- 
tion whether or not the student received other information from the eval- 
uation other than just a grade, or a pass-fail note. 

The second result of the analysis of the 54 clinical forms concerns 
the format of the instrument. Thirty-nine medical schools used a modi- 
fled Likert format for their evaluation form. A Likert format consists 
of a series of phrases or statements each rated on a **strongly agree, 
agree, . . . , strongly disagree" scale. Four examples of modified 
Likert formats are in Table 1. Several instruments also had an adjective 
check list* 



Insert Table 1 about here 



Fifteen schools had only short answer questions, i.*^* > rating scales* 

The final result described will be the frequency of tasks or con- 
cepts from the content analysis of the 54 evaluation forms. Several 
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hundred terms were combined into 90 tasks or concepts: 24 dealing with 
knowledge, 6 with interpersonal relations, 31 with personality traits, 
24 with specific skills, and 5 v^ith miscellaneous. 

Table 2 groups the most frequently used concepts by frequency of 



Insert Table 2 about here 



occurrence in different schools. The most frequently evaluated tasks or 
concepts were "academic" or "knowledge;" nearly all fonas requested the 
rater to comment specifically on the student's knowledge or understand- 
ing. 

After knowledge, the next two most frequently used concepts were 
"gets along well with others" and "hard worker." The concept of getting 
along well was expressed variously: "acceptable to others," "works well 
on a team," and "liked by coworkers ad hospital personnel" were common 
expressions. The phrases "human relations" and "personality" may partly 
bear on this concept. Being a "hard worker" was expressed by words rang- 
ing from "industrious" and "does more than his share of work" to "lazy" 
and "apathetic." 

The next most frequently evaluated skills or concepts were "ability," 
"dignity," "history taking," and "performance." The concepts of ability 
and achievement or knowledge may overlap (14-15). The somewhat less 
specific words of "behavior," "bearing," "emotional maturity," and 
"manner" probably relate for the most part to the concept of dignity. 
The skill of what some schools term "history taking" may be part of what 
others describe as "performance." Within the label of history taking Is 
the task of keeping charts and records. 

These seven concepts complete those used by at least two-thirds of 
the responding medical schools. Of the eight concepts reported by 20 to 
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29 medical schools (about half of the schools responding) the most 
important is probably conducting a physical examination. 

The concepts mentioned by about 20 to 40% of the responding medical 
schools, listed in Table 2 under the 10-19 heading, also are used to de- 
scribe an ideal medical student. Such a medical student seems to be 
prompt and present rather than absent. When he is present, he partici- 
pates. Some schools expect the ideal student to be neat in appearance* 
He asks questions (but , admonish some forms, asks respectfully), accepts 
criticism well, and is willing to do what is asked. He communicates 
well in spoken and written language, and he presents data or cases well* 
This skill in communication may stem partly from the fact that he is 
well organized, he reads and uses the library, he has knowledge of facts 
learned in basic science, and synthesizes ("correlates") and applies 
knowledge well. He does well at management of the patient, patient care, 
and patient problems. Probably most important, he is good at analyzing 
and solving problems* 

Terms that appeared only infrequently can be pointed out* These in- 
clude adaptability, anxiety, being aware of patient change, being aware 
of economic factors (costs) and outside agencies, discernment (this con- 
cept may be subsumed within others discussed above) , having a sense of 
humor, being prepared (specifically mentioned by only one school), and 
the two skills of following up and listening, both quite Important. 

Discussion 

Six purposes of evaluation were describea. No form was classified 
as evaluating learning, instruction , or as being used for research pur- 
poses for the reasons given above* 

Evaluation instruments for learners should approximate the specific 
skills being measured* An evaluation may be potentially destructive to 
the educational process, particularly if students study for the examina- 
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tion rather than for the acquisition of the skill, ^ Consequently several 
studies have used simulation techniques, actors as patients, and special 
testing methods (3, 16-28) so that the examination closely approximates 
the performance. Further support to the idea that examination should re- 
semble performance is lent by reports that physician performance Is pre- 
dicted better by length and type of internship and residency than by medi- 
cal school ratings, (29, 30) As the evaluation format approaches in sim- * 
ilarity the actual clinical skill, then students increase their effoits 
toward acquiring th:i skill, and decrease sycophantic behavior toward the 
persons whom they j)erceive will subjectively rate them. 

The measurement of learning necessarily requires assessments at two 
or more points over time. None of the 54 clinical forms were classified 
as measuring learning because such classification could not be done re- 
liably. Nevertheless, it seemed that the great majority of the forms 
measured learners, and few measured learning. Surely the evaluation of 
learning would be a priori as Important as the evaluation of learners. 
David P. Ausubel wrote on the first page of one of his books, . • the 
most important single factor influencing learning is what the learner al- 
ready knows. Ascertain this and teach him accordingly/' (31) Several 
evaluation forms called for a subjective estimate of students' "growth," 
which is probably a request for the evaluation of learning. To be useful 
and reliable, learning should be measured objectively, not subjectively. 
Again, instead of a subjective estimate of students' "potential," a more 
reliable, objective estimate might be obtained from a discriminant analysis, 
comparing scores of present students with scores on that same Instrument of 
previous students who later became satisfactory and satisfied physicians In 
a certain career. It is possible of course that a subjective rating of 
learning on an evaluation form might in actuality be a summary of objective 
scores. 

If the purpose of the summatlve evaluation form Is to provide data 
from which internship recommendations will be written, then the evalua- 
tion might result from compiling critical incidents (6, 26, 32-41) de- 
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rived from faculty or physician perceptions which are then categorized 
by content area. The construction of a content sampling map will assure 
that all important objectives are discussed in their proper proportion % 
It should be emphasized that getting a "wide variety of important topics'* 
(42) does not satisfy the requirements of content sampling in terms of 
objectives and tasks. 

If the goal is to give the student feedback on his strengths and 
weaknesses, then the items need to be translated back into behavioral 
objectives so the student can readily tell which objectives he achieved 
and which he did not. 

If a single evaluation form is to have several purposes, then it 
probably will have several sections, each of which is processed differ* 
ently. A single instrument that has two purposes (such as the evalua- - 
tion of learners and the providing of student feedback) may serve iseither 
of them well. Years ago the army discovered that captains were reluctant 
to give lieutenants low ratings, or to indicate a poor performance, be- 
cause the lieutenants would find out who had rated them low. One attempt 
to overcome this was described by Sisson (43). If an evaluation form is 
used for the two purposes of evaluation of learners and for student feed-- 
back, then medical faculty may not make wholly honest ratings, and stu* 
dents who would benefit from frank counsel may never receive 5.t. 

Before closing the discussion it is well to emphasize two important 
principles of measurement, reliability and validity. If a test measur- 
ing knowledge is reliable, then if Fete scored high in knowled^^e on Mon- 
day, then he will also score high in knowledge on Wednesday* Good mea~ 
surement suggests that whatever is measured be measured reliably. Five 
principles affecting reliability were often violated on the evalnation 
forms. The first principle is specificity. Vague concepts such as 
"personality," "human relations," ^"habits," "bearing," "manner," and 
"social" cannot be measured reliably by ratings, since Dr. Jones* notion 
of "personality" differs from Dr. Smithes. Forms using these vague 
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terms will produce unreliable data. "Ability/' "performance," and 
"skills" are useful concepts, but are wholly unsatisfactory when used as 
stems for single-item rating scales. 

The second principle is that measurement should be as objective as 
possible. It is surprising that knowledge, which is TiOt difficult to 
measure by multiple-choice tests, would so frequently occur as a sub- 
jective estimate. A subjective rating is probably the least reliable 
measure of knowledge; nevertheless, several evaluation forms required 
faculty members to indicate their subjective appraisal of the amount of 
soneone's knowledge. A more reliable measure of knowledge than subjec- 
tive rating is an objective measurement: a written test, or the perform- 
ance of a standard clinical task before a trained observer, or the stu- 
dent's reaction to simulated stimulus. (4) In fact, relevant objective 
measures should be obtained instead. of subjective ratings wherever pos- 
sible. (44, 45) It is posijible that the "knowledge" scale on some evalu- 
ation forms is a summary index of students* weighted total scores on ob- 
jective knowledge instruments. Two or three schools enclosed samples of 
their objective instruments to ^ne^icure knowledge. It is important to 
recognize that the following stat<iment is not true: "any method producing 
more objectivity with respect to the evaluation of medical students pro- 
vides measurable advantages, . . ." (46, p. 345). Objectivity is better 
only if it is pertinent to the behavioral objectives and the results of 
the task analysis. Height of students is an objective measure, but it is 
not relevant to clinical tasks. 

Third, a person should not rate a quality or skill unless he has 
observed it first hand. (41, 47) Courts recognize this recommendation by 
distinguishing between acceptable evidence and hearsay* Some rating 
forms have a column "don't know" or "no chance to observe." Several 
evaluation forma correctly instructed the rater not to evaluate any trait 
or skill that he had no personal knowledge of. 

Fourth, questions or scales can be most reliably responded to if 
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each scale measures only one trait. Thus a more reliable form would have 
a format such as example I, III or IV in Table 1, and would not have a 
format such as example II in Table 1, in which several concepts are repre- 
sented by a single scale, and the anchors (descriptors) at one end of the 
scale are not opposites in meaning of the anchors at the other end of the 
scale. An exception would occur if reliable factor analyses indicated 
that different scales loaded highly on the same trait • 

Fifth, the skills being measured should have been previously well de- 
fined, and evaluation be limited to that well-defined set. (2) An evalu- 
ation instrument, after it has passed the developmental stage, should con- 
sist of scales or items that reflect the well-defined content being mea- 
sured, rather than consist of large blank spaces for unfettered English 
comments. The use of printed scales will assure that the agreed-upon ob- 
jectives are being considered by raters, and a weighted score can be 
quickly computed and compared against previously-agreed upon criteria 
of pa.q.q. ff^il or A. B, C. D. Although it is meritorious ^to measure un- 
obtrusively where possible,^ (48) particularly in the affective domain, 
the use of patient records to evaluate clinical performance has not been 
satisfactory (36, 49) probably because of the wide variability in record 
keeping. 

The widespread use of^^mments is difficult to understand when it is 
generally acknowledged that comments are time-consuming for faculty to 
make and are time-consuming to analyze and evaluate. Although comments 
are widely used in the developmental stages of on instrument, they are 
typically not retained in the final revision 

Comments could perhaps be justified in a final versicn of the instru- 
ment in three instances. First, comments mlp,ht record highly unusual 
events (<£•£•, "student is deaf but nevertheless achieving satisfactorily"). 
Second, if a copy of the instrument is to be given to the student, then 
comments could be Included to advise a particular student, e^«£.«f "re- 
view Peters Ch >er 2, particularly the 2nd and 3rd sections." Third, 
comments may provide descriptive vignettes for an internship recommenda- 



13 



12 

tion letter. Again, higher reliability would result if these vignettes 
could be related to a content map of agreed-upon critical incidents 
that described unacceptable or acceptable practices. 

Other than these three instances, comments should not be encouraged. 
A student will learn as much by seeing himself checked high on a scale 
of "hard-working and industrious" as he will be seeing a hand-written 
comment, "Joe is hard-working and industrious." A faculty member might 
not write that phrase in longhand, even chough it might be appropriate. 

A second important measurement principle is validity. A validity 
index will Indicate the degree to which the Instrument measures what it 
should* Presumably, if the instrument has been designed as part of a 
learning system as outlined herein, the instrument will have at least 
respectable content and construct validity, although this presumption 
should be tested rather than assumed. In general, measures of clinical 
evaluation have not predicted physician performance, partly because few 
clinical performance instruments are either reliable or valid, partly 
because it is difficult to decide what constitutes satisfactory physi- 
cian performance, and partly as Price, Taylor and others point out, phys- 
ician performance is not unidimensional. (50, 59) ?ome Boards have made 
an effort to improve the reliability and validity of their assessment 
techniques (23, 26, 30, 35, 42, 60-64). 

The results of this survey combined with the published literature 
suggest that the following dimensions are important in the measurement 
of clinical competence: knowledge about disease, ability to collect data 
(Including physical examinations and taking histories) , ability to iden- 
tify and solve problems, maintenance of an appropriate relationship with 
patients and colleagues, and commitment to get the job done (a combina- 
tion of hard work and efficiency) . 

Nearly every school in the survey stressed the importance of know- 
ledge. The evaluation of knowledge should be based on the Cask analysis 
and the behavioral objectives, and should be objective rather than sub- 
jective. The ability to collect data also appeared on many evaluation 
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instruments; relevant literature is cited elsewhere ia this survey. 
Data collection skills have been measured by observation and critical 
incidents techniques; such skills should be rated objectively against 
pre-specifled criteria (2, 34, 65). 

Several schools did not explicitly measure problem-solving ability, 
yet this seems to be a crucial trait for successful physicians to have. 
Although important work has been done by Rimoldi, Elstein, McGuire, and 
others as cited herein, in general the large literature on the measure- 
ment and evaluation of problem-solving is not apparently implemented in 
clinical settings. 

The maintenance of an appropriate relationship with patients and 
coworkers, and a commitment to get the job done occurred quite frequently 
in the present sample of clinical evaluation instruments and is supported 
in the literature (4, 34, 65, 66). Sociometric ratings and peer ratings 
on checklists will produce more reliable results than will random obser- 
vations. 

The purpose of this study was to describe some common existing evalu- 
ation processes of clinical blocks of American medical schools. The co- 
operation of those schools participating in the study is greatly appreci- 
ated since both the satisfactory and the unsatisfactory practices helped 
to clarify and sharpen issues that were not universally recognized in the 
medical evaluation literature. By reporting the clinical evaluation 
processes in over fifty medical schools, this study should make it easier 
for an institution to make the first step in a learning system, that of 
analyzing their existing system. Sufficient evidence has been cited to 
demonstrate that those who design evaluation instruments independently 
of a learning system that encompasses steps of specifying objectives and 
analyzing tasks will fail. This study has reported 5 clinical competen- 
cies that appear to be regarded as basic, as well as numerous ancillary 
concepts. Objectives of insitutions (1, 2, 4-6) and of national Boards 
may also be useful^ b^nt an instrument should reflect the objectives and 
constraints of a local institution and not someone else^s. 
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